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Abstract 

Background: Biological pathways are important for understanding biological mechanisms. Thus, finding important 
pathways that underlie biological problems helps researchers to focus on the most relevant sets of genes. Pathways 
resemble networks with complicated structures, but most of the existing pathway enrichment tools ignore 
topological information embedded within pathways, which limits their applicability. 

Results: A systematic and extensible pathway enrichment method in which nodes are weighted by network 
centrality was proposed. We demonstrate how choice of pathway structure and centrality measurement, as well as 
the presence of key genes, affects pathway significance. We emphasize two improvements of our method over 
current methods. First, allowing for the diversity of genes' characters and the difficulty of covering gene importance 
from all aspects, we set centrality as an optional parameter in the model. Second, nodes rather than genes form 
the basic unit of pathways, such that one node can be composed of several genes and one gene may reside in 
different nodes. By comparing our methodology to the original enrichment method using both simulation data 
and real-world data, we demonstrate the efficacy of our method in finding new pathways from biological 
perspective. 

Conclusions: Our method can benefit the systematic analysis of biological pathways and help to extract more 
meaningful information from gene expression data. The algorithm has been implemented as an R package CePa, 
and also a web-based version of CePa is provided. 

Keywords: Pathway enrichment, Biological network, Centrality, Gene expression data 




Background 

As omics and high throughput technology continues to 
develop, researchers can increasingly understand bio- 
logical phenomena at the systems level; that is, can elu- 
cidate the complicated interactions between genes and 
molecules responsible for biological functions [1]. 
Microarray technology has been widely used to measure 
gene expression profiles and has produced huge 
amounts of data for biological analysis [2]. However, 
traditional single gene analysis tells us little about the 
cooperative roles of genes in real biological systems. 
New challenges for microarray data analysis are to find 
specific biological functions affected by a group of 
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related genes. Biological pathways are sets of genes or 
molecules that act together by chemical reactions, mol- 
ecule modifications or signalling transduction to carry 
out such functions [3]. Since pathways are essentially 
integrated circuits that actualize specified biological pro- 
cesses, perturbation of pathways may be harmful to 
regular biological systems. Thus, finding biologically im- 
portant pathways can assist researchers in identifying 
sets of genes responsible for essential functions. Cur- 
rently, amount of tools are available to identify which 
pathways are significantly influenced based on the tran- 
scription level change of member genes [4,5]. In other 
words, they identify pathways where differentially 
expressed genes are enriched. 

Since a pathway can be denoted as a set of genes, 
pathway enrichment belongs to a more general category 
of methods termed gene set enrichment. Two main 
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categories of enrichment methodologies exist: over rep- 
resentation analysis (ORA) and gene set analysis (GSA) 
[6]. The former only focuses on the number of differen- 
tial genes in the pathway, while the latter incorporates 
the entire gene expression from microarray datasets. In 
fact, ORA is a special case of GSA, utilizing a binary 
transformation of gene expression values. In standard 
ORA, the correlations between genes within the pathway 
and those that are differentially expressed are evaluated 
by Fisher's exact test or chi-square test, in form of a 
2x2 contingency table [7]. The most popular ORA on- 
line tool in current use is DAVID [8], which supports a 
variety of species and gene identifiers. For researchers fa- 
miliar with the R statistical environment, the GOstats 
package [9] is a highly recommended ORA analysis tool. 
GSA methods are implemented via either a univariate or 
a multivariate procedure [6]. In univariate analysis, gene 
level statistics are initially calculated from fold changes 
or statistical tests (e.g., f-test). These statistics are then 
combined into a pathway level statistic by summation or 
averaging [6]. GSEA [10] is a widely used univariate tool 
that utilizes a weighted Kolmogorov-Smirnov test to 
measure the degree of differential expression of a gene 
set by calculating a running sum from the top of a 
ranked gene list. Multivariate analysis considers the cor- 
relations between genes in the pathway and calculates 
the pathway level statistic directly from the expression 
value matrix using Hotelling's test [11] or MANOVA 
models [12]. Besides these standard models, extended 
models of GSA exist. For example, GSCA (Gene Set Co- 
Expression Analysis) [13] aims to identify gene sets 
whose members have different co-expression structures 
between phenotypes; ROAST [14] uses a Monte-Carlo 
simulation for multivariate regression which is applicable 
to diverse experimental designs; GGEA (Gene Graph 
Enrichment Analysis) [15] evaluates gene sets as Petri 
networks constructed from an a priori established gene 
regulatory network. Further studies have focused on the 
methodology issues of gene set enrichment analysis, 
such as evaluating the power of different statistical mod- 
els [6,16], generating null distributions of gene set scores 
[17,18], and overlapping of gene sets [19-21]. The ap- 
proach of gene set enrichment analysis is also applicable 
to a broad range of systems-biology-related fields, in- 
cluding functional network module analysis [22] and 
microRNA target prediction [23,24]. 

Current enrichment methods are limited for pathway 
analysis because they treat genes identical in pathways. 
Rather than comprising a list of genes, a pathway identi- 
fies how member genes interact with each other. Clearly, 
perturbation on a key gene will make more considerable 
effect for the pathway than on an insignificant gene. 
Since a pathway is represented as a network with nodes 
and edges, its topology is essential for evaluating the 



importance of the pathway. To date, few pathway en- 
richment studies have incorporated any topological in- 
formation. Gao et al. [25] designed a pathway score in 
which the values of all connected gene pairs are 
summed, where the value of a gene pair is obtained by 
multiplying the absolute normalized expression values of 
the paired genes. Hung et al. [26] defined a value for 
each gene based on the closest correlated neighbor 
genes, and assumed this value as the weight of the 
Kolmogorov-Smirnov test in GSEA procedure [10] for 
each pathway. Draghici et al. [27] introduced a topology 
term into the scoring function, reflecting that the im- 
portance of genes is enhanced if they in turn influence 
important downstream genes. Thomas et al. [28] 
assigned larger weights to upstream and downstream 
pathway genes, and to genes having high connectivity, 
and then integrated into the maxmean statistics [29]. 
Currently available methods determine the importance 
of genes in the pathway by a single measure. However, 
because of the complexity of biological pathways and 
the varying characteristics of genes, such single-measure 
quantitation cannot fully capture the properties of differ- 
ent genes on biological environment. Thus, a model that 
comprehensively integrates both enrichment and top- 
ology information is urgently required. 

Here, we propose a general, systematic and extensible 
enrichment methodology by which to find significant 
pathways using topology information. Two improve- 
ments of our method over current methods are appar- 
ent. First, given the diversity of genes' characteristics and 
the difficulties of covering gene importance from all 
angles, we do not assume a fixed measurement for each 
gene but allow the user to specify the method by which 
network nodes will be weighted, as an optional param- 
eter in the model. This feature enables researchers to as- 
sess gene importance from a perspective relevant to 
their particular biological problem. In our model, the 
importance of genes in pathways is assessed by network 
centralities. In graph theory, centrality provides a means 
of ranking nodes based on network structure. Different 
centrality measurements assign importance to nodes 
from different aspects. Degree centrality quantifies the 
number of neighbours to which a node directly con- 
nects, while betweenness defines the number of informa- 
tion streams passing through a given node. Generally 
speaking, large centrality values are assigned to central 
nodes in the network. Nodes representing metabolites, 
proteins or genes with high centralities are essential for 
maintaining biological networks in steady state [30,31]. 
Moreover, the relevance of a particular centrality meas- 
urement may vary according to the biological role of the 
pathway [32,33]. Choice of centrality measurement 
depends on the types of genes considered important in 
the pathway. Second, nodes rather than genes are taken 
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as the basic units of pathways in the model. In general, 
the regular biological functions in significant pathways 
are usually altered where abnormal pathway states arise 
from abnormal internal node states. We note that path- 
way nodes may represent not only single genes, but also 
complexes and protein families. For a complex compris- 
ing more than one gene, if one member gene has been 
altered, the function of the whole complex is disrupted. 
On the other hand, a single gene may reside in multiple 
complexes; if this gene loses its function, all of its com- 
plexes will be influenced. Therefore a mapping proced- 
ure from genes to pathway nodes is applied in our 
model. The pathway nodes further include non-gene 
nodes such as microRNAs and compounds, which also 
contribute to the topology of the pathway. Hence, all 
types of nodes are retained in our pathway analysis. 

In this article, the original pathway enrichment 
method is extended by assigning network centralities as 
node weights, and nodes are mapped from differentially 
expressed genes in pathways. The model is flexible in 
that it can readily accommodate available gene set en- 
richment methods and various topological measure- 
ments. Through a simulation study, we demonstrate 
how pathway significance depends on network structure 
and choice of centrality measurement. In the analysis of 
liver cancer data set, our model identified relevant bio- 
logical processes which were bypassed using existing 
methods. We also demonstrate how key genes affect the 
significance of pathways directly underlying biological 
processes. 

Results and discussion 

Because ORA methodology is easily implemented and 
rapidly executed, it is favored over GSA in applications 
[8]. Therefore, we focus mainly on the centrality-based 
extension of ORA, while the extension of GSA will be 
discussed briefly at the end of this article. 

Mapping genes to nodes 

Since a pathway represents as a network, the basic unit 
of the network (the node) is not always a single gene. In 
real biological pathways, the nodes can also represent 
complexes or protein families. Moreover, the product of 
a particular gene may be incorporated into different 
complexes to serve different functions. Such diverse 
roles of gene products are ignored by traditional ORA 
methods, possibly leading to erroneous interpretations. 
Abnormal node states are expected to contribute to the 
abnormal states of pathways. As previously mentioned, 
the function of a multi-gene complex is affected by alter- 
ation of any one gene in the complex, while alteration of 
a multi-complex gene influences all of the complexes in 
which the gene resides. Merely counting genes in path- 
ways cannot reflect these different types of roles played 



by different genes. In a real-world pathway catalogue, a 
node typically comprises two or more genes, and some 
genes locate in multiple complexes or families. Among 
pathways in the NCI-Nature catalogue of Pathway Inter- 
action Database (PID) [34], 58.6% of nodes comprise 
more than one gene while 47.2% of genes reside in mul- 
tiple nodes (Figure lA, IB). Compounds and micro- 
RNAs can also form pathway nodes. Although the 
changing quantity of these entities is not captured by 
typical microarray experiments, they may contribute sig- 
nificantly to pathway regulation. Therefore, these types 
of nodes cannot be neglected in topological pathway 
analysis. For the above reasons, the number of genes 
involved in a biological pathway does not correspond to 
the number of nodes in the pathway. Figure IC shows 
how node count varies with gene count in pathways 
extracted from PID. Therefore, in our analysis we map 
genes to the pathway nodes and assume the node as the 
basic pathway unit. In this way, if any member of a com- 
plex or family is differentially expressed, the node repre- 
senting the complex or family is differentially affected. 
We consider that nodes representing protein coding 
genes, compounds and microRNAs are all legitimate 
regulators of pathways. 

Pathway score 

In any pathway enrichment framework, the significance 
of a pathway is evaluated by a pathway-level statistic. 
For example, in ORA, the pathway-level statistic is the 
number of differentially expressed genes in a pathway. 
To account for the varying positions of genes within 
pathways, we introduce a new statistic (here called the 
pathway score), defined as the summation of the weight 
of differentially affected nodes in the pathway: 

n 

s = ^ Widi (1) 
1=1 

, f 1 differentially affected 
^'■=\0 else 

where s is the pathway score, w, is the weight of the 
node (reflecting the importance of the node), n is the 
number of nodes in the pathway, and (i, identifies 
whether the node is differentially affected. The path- 
way score is the aggregate of two components, the 
weight and the number of differential nodes. Therefore, 
if a node has larger weight, i.e. is more important, it 
more strongly determines whether the pathway is signifi- 
cant. On the other hand, large numbers of differential 
nodes also increase the pathway score. Consequently, a 
significant pathway may contain a few highly important 
nodes, while an insignificant pathway contains many 
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Figure 1 Meta-analysis of the pathway catalogue. A) Distribution of the number of member genes in eacli node; B) Distribution of the 
number of nodes in which a single gene resides; C) Relationship between node count and gene count in biological pathways. The pathways are 
derived from Pathway Interaction Database, NCI-Nature catalogue. For figure A and B, points are log-scaled on the Y-axis. 



non-significant differential nodes. In Equation 1, the def- 
inition of w is general and the weight can be assigned 
any value the researcher considers appropriate. Note 
that when w, = 1 for all i, s is simply the number of dif- 
ferential nodes in the pathway. We refer to this condi- 
tion as the equal weight condition in the following 
section. 

Centrality measurements 

The most important information in pathways comprises 
the complicated interactions between genes that govern 
the transmission of biological signals through networks. 
Since pathways present as networks, it is natural to de- 
fine the weight w from topological information. In exist- 
ing methods using topological information, various 
aspects of gene importance are assigned fixed values. It 
is noteworthy that, because genes play different roles in 
biological pathways, it is difficult to design measure- 
ments that cover the entire spectrum of a gene's func- 
tion. Instead of designing single measurements, we 
compute various topological measurements that meas- 
ure the importance of genes from different aspects. 
Since different measurements relate to different bio- 
logical functions, the best practice is to try every choice 
in the search for significant pathways. 

Here, we identify central nodes in pathways using net- 
work centrality. Recall from the Background section that 
centrality in graph theory is a means of ranking nodes 
according to network structure. Two frequently-used 
centralities, degree and shortest path betweenness (or 
more concisely, betweenness), are selected as candidate 
measurements. Since pathways are directed networks, 
degree centrality is denoted as in-degree and out-degree. 



In biological networks, in-degree refers to the number of 
upstream genes directly acting on a given gene, while 
out-degree refers to the number of downstream genes 
directly acted upon by the gene. As previously men- 
tioned, betweenness assesses the amount of information 
streaming through a given node in the network. These 
two centralities are broadly used in biological network 
analysis [31,35]. 

To measure the importance of nodes in the network 
from different aspects, we define an additional centrality: 
largest reach. The largest reach centrality is based on the 
shortest path between two nodes and is affected by all 
the other nodes in the network. The largest reach cen- 
trality determines how far a node can send or receive in- 
formation within the network. It is defined as the largest 
length of the shortest paths to all the other nodes in the 
network. Since information is always transmitted se- 
quentially in biological pathways, the largest reach cen- 
trality can reflect whether nodes stay in the upstream or 
downstream part of the pathway. In a directed network, 
the largest reach is denoted as in-largest reach and out- 
largest reach. 

Other centralities, besides those described above, can 
also be considered. For instance, the closeness centrality 
computes the time required to spread information from 
one node to all other nodes. The eccentricity centrality 
determines whether a node resides in the center of the 
network and whether the distribution of nodes around 
the central node is symmetric. The stress centrality mea- 
sures the extent to which a node can hold network com- 
munications. The eigenvector centrality measures the 
importance of a node based on its connections to other 
high-scoring nodes in the network (which contribute 
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more to the node score than low-scoring nodes). Cen- 
trahties closely related to the eigenvector are Katz's Sta- 
tus Index and PageRank. For more details on this 
subject, readers may refer to [32,33,36]. 

Simulation study 

A novel gene list and a novel pathway are generated in 
the simulation study. In the pathway, we assume that 
every node corresponds to a single gene. The contin- 
gency table for ORA is listed in Table 1. The jj-value of 
the pathway (1.36 x 10"^ by Fisher's exact test, one sided) 
is constant and independent of pathway structure. 

The structure of the pathway is generated as random 
networks. Two representative random network models, 
Erdos-Renyi model [37] (abbreviated to ER) and 
Barabasi-Albert model [38] (abbreviated to BA), are 
selected. These models are the basic random network 
models in graph theory but their network structures dif- 
fer. We generate ER random networks as follows: 1) 
Each pair of nodes has the same probability (l/«) to be 
connected, where n is the number of nodes in the path- 
way; 2) Each connection can choose a direction with 
equal probability {p = 0.5). The BA random network is 
generated as follows: 1) The probability that a node will 
make a new connection is proportional to its degree; 2) 
Each connection can choose a direction with equal prob- 
ability {p = 0.5). In the ER model, node degree follows a 
binomial distribution; while in BA model it follows a 
power law distribution. In the BA model, the majority of 
nodes have few neighbors while a small minority holds 
most connections in the network. Examples of ER and 
BA random networks can be found in Additional file 1. 

The structure of the pathway was generated for 1000 
times, and 40 differential nodes were randomly selected 
from each simulated network. For each simulated net- 
work, we calculate the significant of the pathway. Values 
of in-degree, out-degree, betweenness, in-largest reach, 
out-largest reach centralities, as well as the equal weight 
condition, are compared between our method and trad- 
itional ORA. Note that since every node corresponds to 
a single gene, the equal weight condition approximates 
to the hypergeometric distribution, on which traditional 
ORA is based [7]. 

Since the pathway score is computed from a list of dif- 
ferential nodes, we measure the approximate 



Table 12x2 contingency table for ORA 





In the pathway 


Else 


Total 


Differential 


40 


960 


1000 


Else 


160 


8840 


9000 


Total 


200 


9800 


10000 



The simulated microarray contains 10000 genes of which 1000 genes are 
differentially expressed. The novel pathway contains 200 genes of which 40 
are differential genes. 



distribution of the differential nodes' centrality in each 
simulation by four values: maximum, median, minimum 
and 75* quartile. From these four values, the effect of 
the differential nodes' centralities on the final pathway 
score can be estimated. Figure 2 illustrates /^-values and 
distribution of centralities of differential nodes in each 
simulation under different centrality measurements. The 
proportions of the pathway with /(-values < 0.01 are listed 
in Table 2. Clearly, the significance of the pathway is lost 
when centrality is used as a weighting factor, and levels 
of pathway significance depend on network structure 
and type of centrality measure. For example, in an ER- 
generated network structure in which nodes are 
weighted by in-degree, the proportion of being signifi- 
cant for the pathway is 57.4% out of 1000 simulations. 

When using degree (in and out) as the weight, the ER 
model outputs a larger proportion of significant path- 
ways than does the BA model. In BA, a small minority 
of important nodes (measured by degree) dominates the 
pathway; hence, if differential nodes are randomly picked 
from a BA network, the probability of selecting those 
nodes which yield large pathway scores is low. The ma- 
jority of trials, therefore, generate insignificant pathways. 

It is observed that maximum largest reaches (in and 
out) from both ER and BA networks are similar (around 
10; see Figure 2), but the median values and the 75"^ 
quartile of largest reach in the BA-generated network 
exceed those of the ER-generated network, implying that 
the distribution of largest reach in BA model is right 
shifted relative to that of the ER model (The histograms 
of the largest reach in both models can be found in 
Additional file 2). As a result, when using largest reach 
as weight, the BA model produces a higher proportion 
of significant pathways than does the ER model. This is 
due to the presence of central hub nodes in the BA 
model, which ensure robust information transmission 
and are thus more likely to score high largest reach 
values. 

From the simulation study, we observe that although 
the number of differential nodes in a pathway is significant 
by Fisher's exact test (or by its approximation, the equal 
weight condition), the pathway will not be significantly 
affected if these genes hold less important positions in the 
pathway. The level of significance is affected by both cen- 
trality measurements and network structure. If researchers 
consider that nodes with large degree will be more import- 
ant, without considering the network topology, traditional 
ORA would yield large false positives. In the current simu- 
lation study, the proportion of significant pathway under 
ORA is expected to be 100%; but, when the structure of 
the pathway is assembled by the ER model and assessed 
by degree centrality, there are only 57.4% significant path- 
ways from 1000 simulations. It means there would be 
42.6% false positives from above perspective. 
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Figure 2 (See legend on next page.) 
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(See figure on previous page.) 

Figure 2 P-values and centrality distributions of pathways with different random network structures under different centrality 
measurements. Patiiway topologies are generated from (A) Erdos-Renyi model and (B) Barabasi-Albert model. Comparisons are made between 
in-degree, out-degree, betweenness, in-largest reach, out-largest reach centralities, as well as the equal weight condition. Each plot represents the 
distribution of differential nodes centralities in each simulation, assessed by maximum value, the 75th quartiie, median value and minimum value. 
All data are ordered by p-values on the X-axis. Points in the figure are randomly shifted by small intervals for ease of visualization. 



Influence of key nodes 

We next assess the influence of the key nodes in the evalu- 
ation of pathway significance. For the same novel gene list 
and novel pathway as were used in the simulation study, 
the number of differential nodes in the pathway is varied 
from 1 to 100. The pathway structures are generated from 
the BA model with no directions, and degree is used as the 
centrality measure. Differential nodes may be integrated 
into the pathway via two approaches; 1) from largest to 
smallest degree, and 2) from smallest to largest degree. 

In the BA model the small number of nodes holding 
most connections are the most central nodes, thus they 
contribute majorly to the significance of the pathway. The 
pathway would be altered if these nodes were differentially 
affected. As illustrated in Figure 3, when selecting high- 
degree differential nodes, provided that the number of dif- 
ferential nodes is 5 or greater, the pathway is highly signifi- 
cant ip-vahie < 0.01). By comparison, pathways generated 
from 5 differential nodes by traditional ORA are far from 
significant (p-valuew 1). Applying ORA, the minimum 
number of differential nodes required to achieve j?-value < 
0.01 is 31. On the contrary, if differential nodes in the 
pathway are largely of very low degree, many more of 
these nodes are required to make the pathway significant. 
As shown in Figure 3, at least 90 small-degree differential 
nodes must be selected to render the p-value of the path- 
way less than 0.01. In conclusion, considering the number 
of differential nodes alone cannot fully reflect the signifi- 
cance of the pathway. We reiterate that without highlight- 
ing these key nodes, researchers are likely to make 
erroneous interpretations of biological pathways. 

Real-world data analysis 

We tested our method on a real microarray dataset 
[GEO: GSE22058] [39]. The microarray experiment 

Table 2 Proportion of pathways with p-values < 0.01 in 
simulation study 



Centrality 


ER model 


BA model 


Equal weight 


1.000 


1.000 


In-degree 


0.574 


0.383 


Out-degree 


0.574 


0403 


Betweenness 


0.134 


0.081 


n-largest reach 


0.493 


0.767 


Out-largest reach 


0.448 


0.745 



measures mRNA expression changes in liver cancer 
tissue and adjacent non-tumour tissue. Following gene 
ID matching and duplicated gene merging, 18503 genes 
were obtained. The top 2000 most differentially 
expressed genes (determined by i-test) comprised our 
differential gene list. NCI-Nature pathway catalogue 
from Pathway Interaction Database (PID) [34] was used 
because it is manually curated and reviewed, and is 
highly recommended by the PID database. In-degree, 
out-degree, betweenness, in-largest reach and out-largest 
reach centrality measurements were applied and com- 
pared. In addition, we applied the dataset to equal 
weight condition and traditional ORA because the equal 
weight condition maps genes to nodes, while traditional 
ORA focuses solely on gene number. P-values for path- 
ways are calculated from 1000 simulations and the false 
discovery rate (FDR) is calculated by Benjamini- 
Hochberg (BH) process [40]. Cutoff for FDR is set to 
0.05. 

Figure 4 illustrates the heatmaps of the FDRs of path- 
ways generated under different centrality measurements. 
A complete list of p-vahies and FDRs is tabulated in 
Additional file 3 and Additional file 4. Among the 11 
pathways for which our method agrees with traditional 
ORA using at least one centrality, the PLK pathway, 
MET pathway and MAPK pathway are directly related 
to liver cancers [41,42]. MAPK pathway is significant 
when nodes are weighted by in-largest reach {p-vahie = 
0.001, FDR = 0.025), consistent with expected biological 
phenomena. The differential nodes are mainly located in 
the downstream of the pathway; that is, transcriptional 
factors (e.g. FOS) or cell cycle related factors (e.g. CDK5 
and CD5R1), while few of the upstream genes are 
included in our differential gene list. As the MAPK path- 
way is essentially a cascade of sequential interactions 
[43], weighting its nodes by out-largest reach renders it 
insignificant, whereas weighting by in-largest reach, 
which gives larger weight to the downstream nodes, 
marks the pathway as significant (Figure 5). In other 
words, if the pathway is rendered significant by in- 
largest reach weighting, we can infer that the down- 
stream nodes are differentially affected. 

Among 8 pathways evaluated as insignificant by trad- 
itional ORA but significant by centrality-based methods, 
four have been previously linked to liver cancers 
[42,44,45]. AP-1 pathway is assessed as insignificant by 
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low degree and from low to high degree. Also, traditional ORA was applied for comparison. 
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Figure 5 Summary of MAPK-TRK pathway generated under in-largest reach centrality. A) Distribution of in-largest reach centrality of 
differential nodes in the simulated pathway. The distribution of differential nodes centralities in each simulation is assessed by maximum value, 
the 75th quartile, median value and minimum value; B) Distribution of in-largest reach centrality of all nodes in the real pathway; C) Histogram of 
simulated scores in the pathway; D) Graph view of the pathway where the size of a node is proportional to its centrality value and nodes in red 
represent differential nodes. In figures A and B, dots are randomly shifted by small intervals for ease of visualization. In figures A and C, the real 
pathway score Is marked with a red line. 



traditional ORA because, of the 70 genes involved in the 
pathway, only 15 are differential. However, after map- 
ping genes to the pathway nodes, we obtain 55 differen- 
tial nodes among 114 pathway nodes. Because two key 
genes, FOS and JUN [46,47], combine with a host of 
other genes to form activated complexes in the pathway, 
the mapping procedure increases the number of posi- 
tions that these two genes occupy in the network. There- 
fore the AP-1 pathway becomes more significant under 
equal weight condition than under traditional ORA. As 
another example, the VEGF receptor (VEGFA) is a prin- 
cipal component in the VEGFRl and VEGFR2 signaling 
pathway. As a membrane protein, VEGFA receives large 
quantities of extracellular information and disseminates 
it into intracellular proteins [48]. VEGFA requires- 
VEGFR2 to form an activated complex, hence the repre- 
sentative node possesses high values of both in-degree 
and out-degree, and the degree-weighted pathway is ren- 
dered significant (^-value = 0.002, FDR = 0.034 for in- 



degree; p-val\ie = 0.007, FDR = 0.104 for out-degree). On 
the other hand, VEGFA itself is not differentially 
expressed, but its companion gene VEGFR2 is. Conse- 
quently, an abnormal state of the member gene results 
in a dysfunctional complex. This type of circumstance, 
which cannot be inferred by traditional ORA, empha- 
sizes why nodes, rather than genes, should form the 
basic units in pathway analysis. 

Conclusions 

Pathway analysis can assist researchers to understand 
biological aberrations at a systems level. The functional- 
ity of biological pathways depends upon complex gene 
interactions. Therefore, pathway enrichment tools should 
highlight genes that play important roles in the pathway 
from the view of topology. Here we proposed a systematic 
and extensible methodology, which finds significant path- 
ways using network centrality to weight the nodes. We 
demonstrated that levels of pathway significance depend 
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on choice of pathway structure and centrality measure. 
The method performed favorably when applied to real- 
world data. 

Centrality can reflects the central nodes in a pathway, 
and different centralities assign gene importance from 
different aspects. The use of centralities in biological 
networks can aid in explaining biological phenomena. In 
this work, we demonstrated the advantages of using 
multiple centrality measurements to obtain a complete 
view of the system. Pathway nodes, rather than genes, 
should form the basic units in pathway analysis, since 
many genes must aggregate as complexes in order to 
function completely. The focus on pathway nodes 
accommodates the fact that genes can be members of 
complexes or families, or may exist in many complexes. 
Finally, it should be noted that a high quality and non- 
redundant pathway structure dataset is required. Pro- 
jects like BioPAX [49], which aspire to the integration 
and exchange of biological pathway data, will greatly as- 
sist future pathway analysis. 

Our method can reveal new findings that relate to, 
and can aid the understanding of, current biological 



problems. We consider that our method will become a 
valuable tool in the systematic analysis of biological 
pathways, and will help to extract more meaningful in- 
formation from gene expression data. 

Methods 

To implement the method, a list of differential genes 
and a list of background genes, both formatted with 
gene identifier (e.g. gene symbol or RefSeq ID), is 
required. A list of pathways and their topology, and a 
means of mapping genes to pathway nodes, is also 
required. In this study, 223 NCI-Nature pathways from 
PID (released September 9* 2011) are included. The 
pathway data are parsed from XML format file provided 
by the PID FTP site. The Perl code for parsing can be 
obtained from the author's website (http://mcube.nju. 
edu.cn/jwang/lab/soft/cepa/). The general workflow of 
the method is illustrated in Figure 6. 

Generate mapping data 

PID provides mappings from UniProt ID to node id. In 
this study, gene symbol is selected as the primary 
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identifier ID. The mapping from gene symbol to HGNC 
ID [50] (accomplished via the online "custom down- 
loads" tool in HGNC database) and the mapping from 
HGNC ID to UniProt ID [51] (using idmapping.dat.gz 
on the UniProt FTP site) are first extracted. The final 
mapping from gene symbol to node id is generated by 
merging the above three kinds of mapping data. 

Centrality measurements 

Two commonly used centralities, degree and shortest path 
betweenness, are selected as initial candidate measure- 
ments. Degree centrality quantifies the number of neigh- 
boring nodes to which the node of interest is directly 
connected, while betweenness centrality measures the 
amount of information streaming through a given node. 

To measure the importance of nodes in the network from 
multiple aspects, we defined an additional centrality: largest 
reach. This centrality is based on the shortest path between 
two nodes and the value of the centrality is affected by all 
other nodes in the network. The largest reach centrality 
measures how far a node can send or receive information. 
It is defined as the largest length of the shortest path from 
node V to all other nodes in the network (see Equations 3 
and 4 where d{w, v) refers to the shortest path length be- 
tween nodes v and w). In a directed network, this measure 
is denoted as in-largest reach or out-largest reach. 



max {d{w, v)} 



wev 



(3) 
(4) 



Users of our system can replace the provided centrality 
measures with their centrality measurements of interest. 
It is recommended that centrality choice is guided by 
biological plausibility and/or reality. 

Pathway score 

The score is defined as the summation of the weights of 
differentially affected nodes in the pathway 



^ Widi 



, _ J 1 differentially affected 
"^'"10 else 



(5) 
(6) 



where s is the score of the pathway, w, is the weight of 
the i"^ node and reflects the importance of the node, n is 
the number of nodes in the pathway, and di identifies 
whether the i"^ node is differentially affected or not. 

In our model, we weight the nodes by network central- 
ity. Because the network centrality can be zero, an add- 
itional term is added to the weight measure. In Equation 
7, a is a small positive number to ensure that all weights 
are positive, a is chosen to exert marginal effect upon 



the weight. The default value of a is 1/100 of the mini- 
mum non-zero weight. 

= Wi + a (7) 

Theoretical distribution of the pathway score 

To calculate the theoretical distribution of the pathway 
score, we assume that every node is a single gene and that 
our model satisfies the following conditions: 1) genes are 
independent; 2) w and d are random variables; 3) w and d 
are independent; 4) w follows a particular discrete or con- 
tinuous distribution and li is a Bernoulli random variable. 
Thus the pathway score, denoted as S, is also a random 
variable. These conditions are formally expressed as 



w P„{w) 

P{d=l) = l- P{d = 0) = pdiff 



(8) 
(9) 



where p^yf is the probability that a gene is differentially 
expressed. It is calculated as the proportion of differen- 
tially expressed genes on the microarray. 



ribg 



(10) 



Within a pathway of score s, assume that k differential 
genes {d = 1) and n-k non-differential genes {d = 0) exist, 
so that 5 can be written as 



Wi 



Wk + Ok+i 



0„ 



(11) 



Thus the probability that pathway score S is equal to 
or larger than s is 



P{S>s) = J2 



Pdiffi'^ - Pdiff) 



E 



Wi>S 



(12) 



The binomial term of equation 12 is the probability of 
obtaining k differential genes from n genes, and the sec- 
ond term is the probability that the sum of k differential 
genes' weight is equal to or larger than s. The final prob- 
ability P{S > s) is the summation over all conditions of k. 

Since genes are independent, provided that Pw{w) is 
known, the distribution of the summation of w can be 
calculated. For instance, given a pathway with ER ran- 
dom network structure in which nodes are weighted by 
degree, w will follow a binomial distribution and thus P 
(Z,w,) also follows a binomial distribution. 

Non-parametric null distribution of the pathway score 

In applications, because the weight distribution is not easily 
determined and nodes are not independent after the map- 
ping procedure, the theoretic distribution is difficult to cal- 
culate. A non-parametric null distribution of s can be 
generated through simulation. For every gene in a pathway, 
we guess whether it is differentially expressed. Similar to 



Gu et al. BMC Systems Biology 2012, 6:56 
http://www.biomedcentral.eom/1752-0509/6/56 



Page 12 of 13 



throwing a coin, we assume that each gene has a probabil- 
ity Pdiff (calculated by Equation 10) of being differentially 
expressed. In each simulation, we obtain a list of simulated 
differentially expressed genes in the pathway. This simu- 
lated differential gene list is then mapped to the pathway 
nodes. The pathway structure is unchanged and the simu- 
lated pathway score is re-calculated from Equations 5 and 
6. The significance is calculated as the proportion of the 
simulated score exceeding the real score (Equation 13). 

p = #{s'''""'"''>s} /#{simulation} (13) 
Extension on GSA 

The ORA centrality-based enrichment method yielded 
plausible, biologically relevant results in the simulation 
study and real-world data analysis. However, an oft- 
mentioned drawback of ORA is that an objective cutoff is 
appointed in the acquisition of a differential gene list, with 
the following consequences: 1) The resulting pathway or 
network may be sensitive to the cutoff [52]. In the 
centrality-based extension of ORA, when a high-scoring 
node is marginally close to the imposed cutoff, this effect 
can be critical; 2) In some circumstances, differential genes 
are too few to apply ORA [53]; 3) Binary transformation of 
expression data leads to loss of information. To address 
these issues, researchers have developed the GSA frame- 
work, which utilizes all gene expression values. Like trad- 
itional ORA however, GSA assumes that genes in pathways 
occupy unvarying positions in the topological structure. We 
propose that our centrality-based enrichment methodology 
can be similarly extended on GSA. In this section, we sug- 
gest, but do not implement, a conceptual methodological 
extension to the GSA method. 

In the traditional univariate GSA procedure, the score 
s of the pathway is defined as: 

s =/(g) (14) 

where / transforms the gene-level statistic to a pathway- 
level statistic (e.g. by summation, averaging) and g is the 
gene-level statistic vector which typically comprises i-values 
[6,10,52]. In ORA, g is a binary variant andy{g) is summa- 
tion. In our model to extend GSA, gene-level statistic is first 
transformed to node-level statistic. We define the vector of 
the node-level statistics as d. When nodes in pathways com- 
prise multiple genes, the node-level statistic can be consid- 
ered as the maximum value of the corresponding member 
genes. Using centrality as the weight, the score is defined as 

s=/(wd) (15) 

where w is the weight vector and the transformation func- 
tion/acts upon the product of w and d. Equation 15 incor- 
porates centrality weight into the original node-level 
statistic. To prevent w from overpowering d (or vice versa) 
when both vectors contain continuous variables, we 



propose that w and d should be normalized. The null distri- 
bution of the pathway score could then be generated by per- 
muting the gene expression matrix. 

Implementation 

The method proposed in the article has been implemen- 
ted in an R package named CePa which is available at 
CRAN (http://www.r-project.org/). In the CePa package, 
four pathway catalogues, namely NCI-Nature, BioCarta, 
Reactome and KEGG from PID, have been integrated. 
Centrality calculation and network visualization are pro- 
cessed by igraph package [54]. A web-based version of 
CePa is available for researchers who are not familiar 
with R programming (http://mcube.nju.edu.cn/cgi-bin/ 
cepa/main.pl), in which Cytoscape Web is used for net- 
work visualization [55]. 

Additional files 



Additional file 1: Two random pathways generated from ER model 
and BA model. Number of nodes in both networks is 200. 

Additional file 2: IHistograms of in-largest reach aggregated from 
1000 networl<s generated either by ER model or BA model. 

Additional file 3: The complete list of p-values of pathways 
generated under different centrality measurements. 

Additional file 4: The complete list of FDRs of pathways generated 
under different centrality measurements. 
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